When it comes to data visualization, the more data we have to work with, the clearer the visual communication becomes. Plotting few hundred, even few thousand data points on spatial map renders a fuzzy visual depending on...
“The greatest value of a picture is when it forces us to notice what we never expected to see.” John Tukey
When it comes to data visualization, the more data we have to work with, the clearer the visual communication becomes. Plotting few hundred, even few thousand data points on spatial map renders a fuzzy visual(depending on the type of image). Dropping a million data points on the same spatial map gives a much crisper better detailed visual, and can bring to light what may not be obvious with smaller data.
The Objective for this blog is to attempt a million spatial geocode data mapping with ggplot2. I am also motivated to load test (stress test) the R graphics system on my home computer, and see if the resulting graphics can be visually useful and interpretable .
New York City Taxi and Limousine commission (NYCTL) graciously makes available a geocoded data customer pickup and drop-off data for the Taxis and Limousines in the city. It is sharing the data since 2009 on a monthly bases. Each Month contains, roughly, a million plus pickup and drop off data. Which is just about right size for our test. For this blog, we will be using NY city yellow cap taxi geocode data for the month of January 2016.
As always, we start by loading the packges needed to clean the and map the spatial data. The packages uses were as follows:
rm(list = ls()) #clean slate
# load packages
library("dplyr") # data wrangler
library("ggplot2") # graphic
library("ggmap") # map
library("ggthemes") # theme
library("knitr") # A general-purpose tool for dynamic report generation in RNext up is tidying (cleaning) the raw geocoded data. Because this is a large dataset (large enough for home computer), we will carve out features we will not need, and engineer features that do not exist to make the data better suited for our stress plotting test.
Here we are separating the date, month, hour, minute and year so we can use it drill down and study traffic patterns by day/hour/min if we needed to. Figure 2 & Figure 3, for example, use the newly generated factored variable to demonstrate busy hour count for 140 thousand yellow cab taxi rides in January. Following code clean the data and generate the dataset. A snipt of the data after data wrangling is shown on Table 1.
# download the data from NY taxi and limo commission
# ny_yellow_cab_jan2016 <- "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-01.csv"
ny_yellow_cab_jan2016 <- read.csv("./data/ny_yellow_cab_jan2016.csv", stringsAsFactors = FALSE)
set.seed(2022)
# Tidy the data using dplyr select and mutate functions
df_ride_total <-
ny_yellow_cab_jan2016 %>%
select(tpep_pickup_datetime,
pickup_longitude,
pickup_latitude,
dropoff_longitude,
dropoff_latitude) %>%
mutate (year = as.numeric(substr(tpep_pickup_datetime, 1, 4)),
month = as.numeric(substr(tpep_pickup_datetime, 6, 7)),
date = as.numeric(substr(tpep_pickup_datetime, 9, 10)),
hour = as.numeric(substr(tpep_pickup_datetime, 12,13)),
min = as.numeric(substr(tpep_pickup_datetime, 15,16))
)
# arrange and select relevant variable
df_ride_total <- df_ride_total[,c(6:10,2:5)]
# remove the orig data to free up space
rm(ny_yellow_cab_jan2016)
knitr::kable(head(df_ride_total, 10), digits = 3, caption = "Table1: Here is sample of the data after its wrangled")| year | month | date | hour | min | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude |
|---|---|---|---|---|---|---|---|---|
| 2016 | 1 | 1 | 0 | 0 | -73.990 | 40.735 | -73.982 | 40.732 |
| 2016 | 1 | 1 | 0 | 0 | -73.981 | 40.730 | -73.944 | 40.717 |
| 2016 | 1 | 1 | 0 | 0 | -73.985 | 40.680 | -73.950 | 40.789 |
| 2016 | 1 | 1 | 0 | 0 | -73.993 | 40.719 | -73.962 | 40.657 |
| 2016 | 1 | 1 | 0 | 0 | -73.961 | 40.781 | -73.977 | 40.759 |
| 2016 | 1 | 1 | 0 | 0 | -73.980 | 40.743 | -73.913 | 40.763 |
| 2016 | 1 | 1 | 0 | 0 | -73.994 | 40.720 | -73.966 | 40.790 |
| 2016 | 1 | 1 | 0 | 0 | -73.979 | 40.745 | -73.992 | 40.754 |
| 2016 | 1 | 1 | 0 | 0 | -73.947 | 40.791 | -73.921 | 40.866 |
| 2016 | 1 | 1 | 0 | 0 | -73.998 | 40.724 | -73.996 | 40.688 |
The data is now ready for the stress test! But, before plotting the million geocodes, each with its own longitude and latitude, we use the qmap a wrapper for ggmap function to call and plot a spatial map of New York City on a map from OpenStreetMap.
“OpenStreetMap is built by community of mappers that contribute and maintain data about roads, trails, cafés, railway stations, and much more, all over the world.” And is freely available.
Because Spatial map is static, and we will be using relatively small foot print of NY city, it took tuning of a number of features to not overwhelm and obscure the spatial map of the city with over plot. The consideration included what level of zoom to use for the city map, size & color of each geocode and level of shade and opacity for the map.
The objective is to have the streets visible with as as many of the one million data points on the map captured.The following code chunk munged the tidy data and attempt to plot a million geocode over a spatial map of NY city.
#Get the patial map from openstreetmap
pp_qmap_z13_test_normal <- qmap("new york, ny",
zoom = 13,
darken = .3,
exten = "normal",
maptype = "toner-lite",
source = "stamen")
#Plot the 1 million geocode data on the map
pp1_fig1 <- pp_qmap_z13_test_normal +
geom_point(aes(x = dropoff_longitude,
y = dropoff_latitude),
data = head(df_ride_total, 1000000), #1 million data
colour = "darkolivegreen1",
alpha = .4,
size = .000001) + theme_gdocs()
pp1_fig1Fig1:439,000 geocoded data Visualization
What we see in the figure is 439,000 data points! That is because, qmap dropped geocoded data that fall outside of visible longitude and latitude spatial map. Interestingly the warning ggplot returns for not plotting geocode that fall outside zoomed in coordinate is “missing data”, which is a bit misleading (took me for a spin). But what it means is that the data isn’t plotted because it doesn’t fit the imported Open Street map at the current zoom level.
I had successfully plotted all 1 million at a higher zoom out that captures most of NY city. However, the plots overwhelm the map obscuring the street level view, so that plot isn’t included here. Still, I think, it is impressive to view 439k plots in such a small foot print.
Now that we know ggplot’s extension qmap can handle massive data plotting. Can we improve the data to get better understanding of taxi traffic flow? Yes, it can. Visualizing the data can be used to plan street traffic engineering and planning for street lights modification for time of the day based on traffic patterns.
Here I will modify the table to include a factor variable that will separate the data splitting the 24 hour in a day into four. Morning, Afternoon, Evening and Midnight. We can then plot a traffic map that separately each time quadrant using ggplot facet feature to visualize, at high level, traffic congestion or flow at different parts of the day.
The following chuck takes the data used earlier, and generate a new factored feature that splits the 24 in two four. It then uses the qmap function to plot a spatial map.
###Get the data for the time of the day plot
set.seed(2022)
df_fig2 <- df_ride_total %>%
mutate(TimeOfDay = as.factor(ifelse(hour >= 0 & hour < 6 , "Midnight - 00-06AM", #adding a new categorical data
ifelse(hour >= 6 & hour < 12 , "Morning - 06-12AM",
ifelse(hour >= 12 & hour < 18 , "Afternoon- 12-18PM",
ifelse(hour >= 18 & hour <= 23 ,"Evening - 18-23PM",
hour
))))
))
## plot the data
pp_qmap_z13_test_normal <- qmap("new york, ny",
zoom = 13,
darken = .3,
exten = "normal",
maptype = "toner-lite",
source = "stamen")
pp1_fig2 <- pp_qmap_z13_test_normal +
geom_point(aes(x = dropoff_longitude,
y = dropoff_latitude,
colour = TimeOfDay),
data = head(df_fig2, 325000),
#colour = "darkolivegreen1",
alpha = .4,
size = .00005
) + facet_wrap( ~ TimeOfDay, nrow = 2) + theme_gdocs()
pp1_fig2Fig2:Time of day Visualizing NY Taxi traffic for Jan 2016
This plot is a follow up to the above map. We use the dataset used to generate faceted plot on figure 2, but counts the traffic count for each of the four part of the day. Bar plot is just right for comparing the counts side by side. The plot is for about 140k geo special points, a subset of the full data.
#summarize the data for bar plot
df_fig3 <- head(df_fig2, 400000) %>% select(TimeOfDay) %>%
group_by(TimeOfDay) %>%
summarise(cnt = n())
options(scipen=5)
pp3_fig3 <- ggplot(data = df_fig3, aes(x = TimeOfDay, y = cnt, label = cnt), size=14) +
geom_bar(aes(fill = TimeOfDay), alpha = .6, stat="identity", position = position_dodge(width = 0.90)) +
geom_text(position = position_dodge(.9), vjust = 2, colour="white", fontface = "bold", size = 5 ) +
xlab("") +
theme_minimal() +
theme(axis.text.x = element_text(angle =45, hjust = 1, size = 10))
pp3_fig3Fig3: Bar plot to quantify traffic count by time of the day.
Spatial data visualization is very useful tool to plot massive amount of data in a small foot print. As seen here, ggplot2/ggmap is capable of processing over a million data points even on home personal computer with limited memory and CPU resources. Vsualizing the taxi traffic data can be used to identify busy hours of traffic in the city and make necessary plans for maintenance schedule, traffic engineering and more.
When mapping large set of data in a small foot print, paying attention to features such as opacity, color code and size selection for each plot point, together with zoom level of the static map produce a visual spatial map that is readable and useful.
On my next blog, I will use leaflet an Interactive Web Maps with the JavaScript to map the same One million geocode data from NY taxi commission, and contrast the pro/con with static Spatial map. We will also explore overlaying geojson data using the shape file provided by the NY Taxi And Lemousine commission.
If you need consultation on this kind of work, feel free to contact abiyu.giday@gmail.com.